测量误差和新回归法则,引子
引言:介绍几种回归方法,其中“总体最小二乘法”和“德明回归”是用来解决“测量误差”(在X和Y)。“最小角回归”、“稳健回归”、“保序回归”和“局部加权线性回归”是用来弥补“普通线性回归”不足而提出的,尽管方法新颖但这些并不见得有多流行。
总体最小二乘法Total least squares
总体最小二乘法是一种较为先进的最小二乘法结构,总体最小二乘法认为回归矩阵存在干扰,在计算最小二乘解时考虑了这个因素,而在一般最小二乘法时没有考虑该因素的影响。总体最小二乘法应用广泛,得到效果也比较好。
total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.
德明回归Deming Regression
1、戴明回归适用条件
(1)两个试剂的车辆误差有着独立的、均数为0的正态分布
(2)误差的方差为成比例的(在数据范围内,方差并不必须恒定的)
2、Passing-Bablok 回归适用于
(1)两个试剂的测试误差遵循同一种分布(不一定是正态分布)
(2)方差的比例是恒定的(在数据范围内,方差不需要恒定)
(3)样品的分布是任意的
Deming regression, named after W. Edwards Deming, is an errors-in-variables model which tries to find the line of best fit for a two-dimensional dataset. It differs from the simple linear regression in that it accounts for errors in observations on both the x- and the y- axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure.
Deming regression is equivalent to the maximum likelihood estimation of an errors-in-variables model in which the errors for the two variables are assumed to be independent and normally distributed, and the ratio of their variances, denoted δ, is known。 In practice, this ratio might be estimated from related data-sources; however the regression procedure takes no account for possible errors in estimating this ratio.
Deming regression: http://pan.baidu.com/s/1bptPygf
最小角回归Least Angle Regression
Efron于2004年提出的一种变量选择的方法,类似于向前逐步回归(Forward Stepwise)的形式。从解的过程上来看它是lasso regression的一种高效解法。
The advantages of the LARS method are:
1. It is computationally just as fast as forward selection.
2. It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune the model.
3. If two variables are almost equally correlated with the response, then their coefficients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.
4. It is easily modified to produce solutions for other estimators, like the lasso.
5. It is effective in contexts where p> n (i.e., when the number of dimensions is significantly greater than the number of points).
The disadvantages of the LARS method include:
1. With any amount of noise in the dependent variable and with high dimensional multicollinear independent variables, there is no reason to believe that the selected variables will have a high probability of being the actual underlying causal variables. This problem is not unique to LARS, as it is a general problem with variable selection approaches that seek to find underlying deterministic components. Yet, because LARS is based upon an iterative refitting of the residuals, it would appear to be especially sensitive to the effects of noise. This problem is discussed in detail by Weisberg in the discussion section of the Efron et al. (2004) Annals of Statistics article. Weisberg provides an empirical example based upon re-analysis of data originally used to validate LARS that the variable selection appears to have problems with highly correlated variables.
2. Since almost allhigh dimensional data in the real world will just by chance exhibit some fair degree of collinearity across at least some variables, the problem that LARS has with correlated variables may limit its application to high dimensional data.
最小角回归法对前向梯度算法和前向选择算法做了折中,保留了前向梯度算法一定程度的精确性,同时简化了前向梯度算法一步步迭代的过程。
最小角回归法是一个适用于高维数据的回归算法,其主要的优点有:
1)特别适合于特征维度n 远高于样本数m的情况。
2)算法的最坏计算复杂度和最小二乘法类似,但是其计算速度几乎和前向选择算法一样
3)可以产生分段线性结果的完整路径,这在模型的交叉验证中极为有用
主要的缺点是:
由于LARS的迭代方向是根据目标的残差而定,所以该算法对样本的噪声极为敏感。
总结
Lasso回归是在ridge回归的基础上发展起来的,如果模型的特征非常多,需要压缩,那么Lasso回归是很好的选择。一般的情况下,普通的线性回归模型就够了。
另外,本文对最小角回归法怎么求具体的θθ 参数值没有提及,仅仅涉及了原理,如果对具体的算计推导有兴趣,可以参考Bradley Efron的论文《Least Angle Regression》,网上很容易找到。
保序回归(Isotonic Regression)
对生成的数据进行保序回归的一个实例.保序回归能在训练数据上发现一个非递减逼近函数的同时最小化均方误差。这样的模型的好处是,它不用假设任何形式的目标函数,(如线性)。为了比较,这里用一个线性回归作为参照。
保序回归(isotonic regression)属于回归算法,对一个有限的实数集合Y表示观测响应,X集合表示未知的响应值,进行拟合找到一个最小化函数:
X是排序的,w是大于0的权重,最终的函数被称为保序回归,并且是唯一的。可以看做是排序限制下的最小二乘问题。基本上保序回归是拟合原始数据最佳的单调函数。
MLlib支持的算法平行化保序回归,保序回归有一个参数isotonic,默认值是true,此参数指定保序回归是保序的(单调增加)还是不保序的(单调减少)。
保序回归的结果被视为分段线性函数,预测的规则是:
1. 如果预测输入能准确匹配训练特征,那么返回相关预测,如果有多个预测匹配训练特征,那么就返回其中之一。
2. 如果预测输入比所有的训练特征低或者高,那么最低和最高的训练特征各自返回。如果有多个预测比所有的训练特征低或者高,那么都会返回。
3. 如果预测输入介于两个训练特征,那么预测会被视为分段线性函数和从最接近的训练特征中计算得到的插值。
稳健回归(robust regression)
统计学稳健估计中的一种方法,其主要思路是将对异常值十分敏感的经典最小二乘回归中的目标函数进行修改。经典最小二乘回归以使误差平方和达到最小为其目标函数。因为方差为一不稳健统计量,故最小二乘回归是一种不稳健的方法。不同的目标函数定义了不同的稳健回归方法。常见的稳健回归方法有:最小中位平方(least median square;LMS)法、M估计法等。
稳健回归:是一类方法的总称,主要是针对异常值的处理方法。该方法的主要目的是检测异常点,并在有异常点的情况下给出模型的稳健估计。
基本思想:是对不同数据点给予不同权重,残差小的给予较大的权重,残差大的给予较小权重,以减小异常值对模型的影响。
Locally weighted linear regression
(局部加权线性回归)
局部加权线性回归其实是一个非参数学习算法(non-parametric learning algorithm),而相对的的,线性回归则是一个参数学习算法(parametric learning algorithm),因为它的参数是固定不变的,而局部加权线性回归的参数是随着预测点的不同而不同。
注:转载务必联系圈圈。